[1] 11
With Applications in R
University of Utah, EEUU
2024-08-23
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)
(Checkout CRAN Task View on HPC)
High Throughput Computing Cluster (HTC Cluster)
Supercomputer (HPC Cluster)
Single Instruction, Multiple Data (SIMD)
In terms of scale
HTC > HPC > Single node > Socket > Core > Thread | SIMD vectorization
How many cores does your computer has?
Ask yourself these questions before jumping into HPC!
Things that are easily parallelizable:
Things that are not easily parallelizable:
Parallelization is not free: Most cost is in sending+receiving data.
In R (and other flavors), you can mitigate by (i) reducing the amount of data communicated, and (ii) reducing the number of times you communicate.
Overhead cost of parallelization: Fitting \(y = \alpha + \beta_k X_k + \varepsilon,\quad k = 1, \dots\) (more about this later)
While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism
Some examples:
Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow
And there’s also a more advanced set of options
PSOCK, Fork, MPI, etc.(Usually) We do the following:
PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)Copy/prepare each R session (if you are using a PSOCK cluster):
Copy objects with clusterExport
Pass expressions with clusterEvalQ
Set a seed
parApply, parLapply, etc.clusterStop| Type | Description | Pros | Cons |
|---|---|---|---|
PSOCK |
Multiple machines via socket connection | Works in all OSs | Slowest |
FORK |
Single machine via forking | Avoids memory duplication | Only for Unix-based |
MPI1 |
Multiple machines via Message Passage Interface | Best alternative for HPC clusters | Sometimes hard to setup |
Using PSOCK, the slurmR package creates clusters featuring multiple nodes in HPC environments, think hundreds of cores.
# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(4)
x <- 20
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
clusterExport(cl, "x")
# 3. DO YOUR CALL
clusterEvalQ(cl, {
paste0("Hello from process #", Sys.getpid(), ". x = ", x)
})[[1]]
[1] "Hello from process #99444. x = 20"
[[2]]
[1] "Hello from process #99442. x = 20"
[[3]]
[1] "Hello from process #99445. x = 20"
[[4]]
[1] "Hello from process #99443. x = 20"
Problem: Run multiple regressions on a very wide dataset. We need to fit the following model:
\[ y = X_i\beta_i + \varepsilon,\quad \varepsilon\sim N(0, \sigma^2_i),\quad\forall i \]
[1] 500 999
x001 x002 x003 x004 x005
1 0.61827227 1.72847041 -1.4810695 -0.2471871 1.4776281
2 0.96777456 -0.19358426 -0.8176465 0.6356714 0.7292221
3 -0.04303734 -0.06692844 0.9048826 -1.9277964 2.2947675
4 0.84237608 -1.13685605 -1.8559158 0.4687967 0.9881953
5 -1.91921443 1.83865873 0.5937039 -0.1410556 0.6507415
6 0.59146153 0.81743419 0.3348553 -1.8771819 0.8181764
num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...
cl <- parallel::makePSOCKcluster(4L)
ans <- parallel::parApply(
cl = cl,
X = X,
MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
)
ans[,1:5] x001 x002 x003 x004 x005
(Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
x -0.06082548 0.02748265 -0.01327855 -0.08012361 -0.04067826
Are we going any faster?
library(microbenchmark)
microbenchmark(
parallel = parallel::parApply(
cl = cl,
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
serial = apply(
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
times = 10,
unit = "relative"
)Unit: relative
expr min lq mean median uq max neval cld
parallel 1.000000 1.00000 1.000000 1.000000 1.00000 1.00000 10 a
serial 2.752872 2.76052 2.743248 2.754032 2.82927 2.52011 10 b
Problem: We want to bootstrap a logistic regression model. We need to fit the following model:
\[ P(Y=1) = \text{logit}^{-1}\left(X\beta\right) \]
[1] 100 5
[,1] [,2] [,3] [,4] [,5]
[1,] -0.13592452 1.1921489 -1.04101654 0.26500638 -0.51561099
[2,] -0.04079697 -0.1231379 -0.43190705 1.38694989 0.39568325
[3,] 1.01053901 -0.5741648 -0.77781632 -0.29149014 -0.78301461
[4,] -0.15826244 -1.4903169 0.37368178 -1.83027672 0.88538861
[5,] -2.15663750 2.3638289 0.31256458 -1.62766978 -0.38212891
[6,] 0.49864683 -2.9510362 0.07122864 -0.01630346 0.05333596
[1] 1 1 0 0 1 0
my_boot <- function(y, X, B=1000) {
# Generating the indices
n <- length(y)
indices <- sample.int(n = n, size = n * B, replace = TRUE) |>
matrix(nrow = n)
# Fitting the model
apply(indices, 2, function(i) {
glm(y[i] ~ X[i,], family = binomial("logit")) |>
coef()
}) |> t()
}
set.seed(3312)
ans <- my_boot(y, X, B=50)
head(ans) (Intercept) X[i, ]1 X[i, ]2 X[i, ]3 X[i, ]4 X[i, ]5
[1,] 2.943576 -0.46986617 2.292807 1.1069735 2.117947 0.7839228
[2,] 4.265760 -0.01445575 3.881603 2.5052960 4.300462 0.0542386
[3,] 2.702185 -0.40973910 2.315127 1.1693082 3.059388 0.2927383
[4,] 4.827939 -1.52854114 2.692226 1.7977035 4.370736 0.7825011
[5,] 3.229396 -0.56316370 1.980704 1.4054200 3.949632 0.2806117
[6,] 2.933971 0.25911455 2.193838 0.6953409 1.970649 -0.3528708
my_boot_pll <- function(y, X, cl, B=1000) {
# Generating the indices
n <- length(y)
indices <- sample.int(n = n, size = n * B, replace = TRUE) |>
matrix(nrow = n)
# Making sure y and X are available in the cluster
parallel::clusterExport(cl, c("y", "X"))
# Fitting the model
parallel::parApply(cl, indices, 2, function(i) {
glm(y[i] ~ X[i,], family = binomial("logit")) |>
coef()
}) |> t()
}
cl <- parallel::makeForkCluster(4)
set.seed(3312)
ans_pll <- my_boot_pll(y, X, cl, B=50)
head(ans_pll) (Intercept) X[i, ]1 X[i, ]2 X[i, ]3 X[i, ]4 X[i, ]5
[1,] 2.943576 -0.46986617 2.292807 1.1069735 2.117947 0.7839228
[2,] 4.265760 -0.01445575 3.881603 2.5052960 4.300462 0.0542386
[3,] 2.702185 -0.40973910 2.315127 1.1693082 3.059388 0.2927383
[4,] 4.827939 -1.52854114 2.692226 1.7977035 4.370736 0.7825011
[5,] 3.229396 -0.56316370 1.980704 1.4054200 3.949632 0.2806117
[6,] 2.933971 0.25911455 2.193838 0.6953409 1.970649 -0.3528708
How much faster?
microbenchmark::microbenchmark(
parallel = my_boot_pll(y, X, cl, B=1000),
serial = my_boot(y, X, B=1000),
times = 1,
unit = "s"
)Unit: seconds
expr min lq mean median uq max neval
parallel 0.1942538 0.1942538 0.1942538 0.1942538 0.1942538 0.1942538 1
serial 0.5178959 0.5178959 0.5178959 0.5178959 0.5178959 0.5178959 1
Problem: Revisit of the overhead cost of parallelization. We want to fit the following model \[y = X_k\beta_k + \varepsilon,\quad k = 1, \dots\]
[,1] [,2] [,3] [,4] [,5]
[1,] -1.91835255 -0.2359106 -1.4642601 -0.5320349 -0.4639574
[2,] -0.09017806 -0.1022420 -0.6735899 1.6146947 -2.3792154
[3,] -1.25551672 1.2079800 0.2159515 -0.1323614 0.9867689
[4,] 1.28006769 -0.2806277 -0.2026345 -0.7375033 -0.1067501
[1] 0.3570720 0.4850507 1.0281664 -0.7044579 1.1378356 -0.9032009
Naive approach: run regressions and return the full output.
Problem: The lm() function returns a lot of information. Recovering all that information is costly:
List of 12
$ coefficients : Named num [1:2] 0.01075 0.00945
..- attr(*, "names")= chr [1:2] "(Intercept)" "X[, 1]"
$ residuals : Named num [1:10000] 0.364 0.475 1.029 -0.727 1.111 ...
..- attr(*, "names")= chr [1:10000] "1" "2" "3" "4" ...
$ effects : Named num [1:10000] -1.075 0.959 1.02 -0.725 1.115 ...
..- attr(*, "names")= chr [1:10000] "(Intercept)" "X[, 1]" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:10000] -0.00739 0.0099 -0.00112 0.02285 0.02688 ...
..- attr(*, "names")= chr [1:10000] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:10000, 1:2] -1e+02 1e-02 1e-02 1e-02 1e-02 1e-02 1e-02 1e-02 1e-02 1e-02 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:10000] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "X[, 1]"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.01 1
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 9998
$ xlevels : Named list()
$ call : language lm(formula = y ~ X[, 1])
$ terms :Classes 'terms', 'formula' language y ~ X[, 1]
.. ..- attr(*, "variables")= language list(y, X[, 1])
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "y" "X[, 1]"
.. .. .. ..$ : chr "X[, 1]"
.. ..- attr(*, "term.labels")= chr "X[, 1]"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(y, X[, 1])
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "y" "X[, 1]"
$ model :'data.frame': 10000 obs. of 2 variables:
..$ y : num [1:10000] 0.357 0.485 1.028 -0.704 1.138 ...
..$ X[, 1]: num [1:10000] -1.9184 -0.0902 -1.2555 1.2801 1.7068 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ X[, 1]
.. .. ..- attr(*, "variables")= language list(y, X[, 1])
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "y" "X[, 1]"
.. .. .. .. ..$ : chr "X[, 1]"
.. .. ..- attr(*, "term.labels")= chr "X[, 1]"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(y, X[, 1])
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "y" "X[, 1]"
- attr(*, "class")= chr "lm"
Let’s start with the naive approach: fitting the model and returning the full output.
library(parallel)
cost_serial <- system.time(apply(X, 2, function(x, y) lm(y ~ x), y = y))
# Running the benchmark
cl <- makePSOCKcluster(4)
cost_pll <- system.time(parApply(cl, X, 2, function(x, y) lm(y ~ x), y = y))
# Stopping the cluster
stopCluster(cl)
c(Serial = cost_serial["elapsed"], Parallel = cost_pll["elapsed"]) Serial.elapsed Parallel.elapsed
3.135 6.308
| Type | Serial | Parallel | Parallel (coef only) | Parallel fork (coef only) |
| Elapsed | 3.135 | 6.308 | 1.102 | 0.676 |
gvegayon
ggvy.cl
george.vegayon@utah.edu
For more, checkout the CRAN Task View on HPC
We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.
So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)
The R code to do this
library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)
# Number of simulations we want each time to run
nsim <- 1e5
# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))
# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
microbenchmark::microbenchmark(
parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
serial = sapply(1:100, pisim, nsim=nsim),
times = 10,
unit = "relative"
)Unit: relative
expr min lq mean median uq max neval cld
parallel 1.000000 1.0000 1.00000 1.000000 1.000000 1.000000 10 a
serial 1.867702 1.8882 1.98731 1.907443 2.290021 2.120476 10 b
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin23.4.0
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.27/lib/libopenblasp-r0.3.27.dylib
LAPACK: /opt/homebrew/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Denver
tzcode source: internal
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] microbenchmark_1.4.10
loaded via a namespace (and not attached):
[1] digest_0.6.36 codetools_0.2-20 multcomp_1.4-25 fastmap_1.2.0
[5] Matrix_1.7-0 xfun_0.45 lattice_0.22-6 TH.data_1.1-2
[9] splines_4.4.1 zoo_1.8-12 knitr_1.47 htmltools_0.5.8.1
[13] rmarkdown_2.27 mvtnorm_1.2-5 cli_3.6.3 grid_4.4.1
[17] sandwich_3.1-0 compiler_4.4.1 tools_4.4.1 evaluate_0.24.0
[21] survival_3.6-4 yaml_2.3.8 rlang_1.1.4 jsonlite_1.8.8
[25] MASS_7.3-60.2